Credit Card Users Segmentation

Created by Lavita Singhania

This project aims to identify potential borrowers and effectively segment them for superior credit dispersal. The project provides insights into the credit worthiness of these segments.

This project uses an unsupervised learning algorithm, K-means Clustering for segmentation of unlabelled data.
The data consists of 1000 customers and 21 attributes for each customer.
Each entry represents a person who takes credit from the bank.

Important variables of the data

  1. Checking Account (text: little, moderate, rich, no checking account)
  2. Duration (numeric: in months)
  3. Credit History (text: No credits taken, All credits paid duly, Existing credits paid duly, Delay in payment, Other)
  4. Purpose (text: car, furniture, radio/TV, repairs, education, vacation, business, others)
  5. Amount (numeric: credit amount)
  6. Saving Account (text: little, moderate, rich, quite rich, no savings account)
  7. Present Employment since (text: unemployed, <1yr, 1 to 4 yrs, 4 to 7 yrs, 7+ yrs)
  8. Installment Rate (numeric: percentage of disposable income)
  9. Sex and Status (text: male & divorced, female & divorced, male & single, male & married, female & single)
  10. Age (numeric: in years)
  11. Housing (text: rent, own, for free)
  12. Job (text: unskilled - non-resident, unskilled - resident, skilled employee, self employed)
  13. Response (categorical: good debt, bad debt)

Project Summary

In [30]:
#Mean values of all variables in the cluster

grouped = german_data.groupby('Cluster', as_index=False)['age', 'duration', 'amount', 'response'].mean().round(2)
print(grouped)

warnings.filterwarnings('ignore')
  Cluster    age  duration   amount  response
0       0  49.78     14.58  2061.84      0.80
1       1  28.39     23.09  2944.75      0.66
2       2  28.91     10.58  1366.42      0.75
3       3  37.95     36.31  7173.58      0.59

Comments:
- Cluster 0 is the group of old aged who take smaller loans for a smaller duration
- Cluster 1 is the group of young customers who take moderate loans for a longer duration
- Cluster 2 is the group of young customers who take really small loans for a very short duration
- Cluster 3 is the group of mid age customers who take very high amount loan for a longer duration

Analysing each cluster, Clusters 0 and 2 have good response rates which indicates that users in these clusters are good borrowers and more such users should be targeted. For Cluster 3, as expected with increase in credit amount, the response rate is less. However considering this segment is most important to the bank, better due diligence should be followed. For Cluster 1, even though the credit amount in only slightly higher than the median value the response rate in pretty low. The loan dispersal process for this segment (younger age group wanting a higher duration loan amount) needs a lot of improvements

Analyzing cluster behavior

In [31]:
plot('job')
In [32]:
plot('present_emp')
In [33]:
plot('sex')
In [34]:
plot('credit_his')

Comments:
- Cluster 0 is a segment of customers who are old aged taking small amount loans for smaller duration. 74% of these customers are either skilled employees or unskilled - residents. They mostly have 7+ yrs of work experience in their current jobs, are single males and 86% of them either have existing credits paid or have other credits existing.
- Cluster 1 is a segment of customers who are young taking moderate amount loans for moderate duration. 71% of these customers are skilled employees. 60% of the users have 1-7 yrs of work experience in their current jobs, they are a good balance of either single males or married/divorced females and 83% of them either have existing credits paid or have other credits existing.
- Cluster 2 is a segment of customers who are young taking very small amount loans for very small duration. 91% of these customers are either skilled employees or unskilled - residents. 66% of these users are just starting jobs with either >1yr or 1-4yrs of work experience, they are a combination of single and married males and married/divorced females and have a credit history.
- Cluster 3 is a segment of customers who are mid aged taking high amount loans for higher duration. There is a high proportion of self employed users here who are evenly distributed between 1-7yrs of work experience. 70% of them are single males and have credit history with significant users having delay in paying their past credits.

Targetable customer base

In [35]:
grouped2 = german_data.groupby(['Cluster','purpose'], as_index=False)['amount','response'].agg(['mean','count'])

grouped3 = grouped2.reset_index()
grouped3.columns = ['Cluster', 'Purpose', 'Mean_Amount','Count', 'Mean_Response', 'Count_Response']
grouped4 = grouped3[['Cluster', 'Purpose', 'Mean_Amount','Count', 'Mean_Response']]
grouped4.head()


fig5 = px.scatter(grouped4, x="Mean_Response", y="Mean_Amount", size="Count", color="Cluster",
                 hover_name="Purpose", size_max=60)
fig5.update_layout(height=600, width=1000, title_text="Credit Worthiness of each segment")
warnings.filterwarnings('ignore')
fig5.show()

Please note: X-axis here is mean amount and Y-axis is Mean Response.
The size of the bubble is indicated by the count of users in the category.
Category is shown when we hover over the bubble.


Comments:
- Ideally, we would want the top right to be more populated in the above chart. To achieve this, below are some of the target areas that can be focussed on by the bank.
- Cluster 0: This segment should be targeted higher with used cars loans. More marketing effort can be done on this segment for used cars to increase the count of users in this segment. Further this segment can be targeted with Radio/TV and New Car loans. If there are users from this segment wanting to take loans for business, more due diligence needs to be done to avoid turning it into a bad loan.
- Cluster 1: This segment of customers are little low on importance since they have lower loan amounts and yet 34% defaulters. This segment can again be targeted with lower loan amounts taken for used cars and business and more due diligence to be followed for Radio/TV and Furniture loans.
- Cluster 2: For this segment more marketing should be done for Business and Furniture loans to increase the number of users. Retraining, although very small can also be considered as a potential target segment for these users. A large portion of the users in this segment are also interested in new car loans, with better due diligence the response rate can be improved.
- Cluster 3: For this segment more marketing should be done to attract Radio/TV and used cars buyers. This is an important segment for the business, since they take a higher loan amount but also have defaulters. Hence due diligence needs to be improved for car, business and furniture loans in this segment to mitigate the risk.

Additionally, if the Bank is looking to give more Automobile loans (for new cars), they should be targeting customers from Cluster 0 and 2 rather than users from Cluster 1 and 3.
For Education loans, Cluster 0 should be targeted which has significantly better response rates and for Furniture/Equipment loans, Cluster 1 and 3 should be targeted rather than Cluster 0 and 2.

Project Working

Loading libraries

In [3]:
#Data manipulation libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

#Data visualization libraries
import plotly.express as px
import pandas as pd 
import plotly.graph_objs as go
from plotly.subplots import make_subplots

import matplotlib as mlt
import matplotlib.pyplot as plt 
import seaborn as sns

#K-means clustering libraries 
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AffinityPropagation

Data processing

In [4]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data"
credit = pd.read_csv(url, sep= " ", names = ["chk_acct", "duration", "credit_his", "purpose","amount", 
                                             "saving_acct", "present_emp", "installment_rate", "sex", 
                                             "other_debtor", "present_resid", "property", "age", 
                                             "other_install", "housing", "n_credits", "job", "n_people", 
                                             "telephone", "foreign","response"])
In [5]:
credit.head()
Out[5]:
chk_acct duration credit_his purpose amount saving_acct present_emp installment_rate sex other_debtor ... property age other_install housing n_credits job n_people telephone foreign response
0 A11 6 A34 A43 1169 A65 A75 4 A93 A101 ... A121 67 A143 A152 2 A173 1 A192 A201 1
1 A12 48 A32 A43 5951 A61 A73 2 A92 A101 ... A121 22 A143 A152 1 A173 1 A191 A201 2
2 A14 12 A34 A46 2096 A61 A74 2 A93 A101 ... A121 49 A143 A152 1 A172 2 A191 A201 1
3 A11 42 A32 A42 7882 A61 A74 2 A93 A103 ... A122 45 A143 A153 1 A173 2 A191 A201 1
4 A11 24 A33 A40 4870 A61 A73 3 A93 A101 ... A124 53 A143 A153 2 A173 2 A191 A201 2

5 rows × 21 columns

In [6]:
credit.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   chk_acct          1000 non-null   object
 1   duration          1000 non-null   int64 
 2   credit_his        1000 non-null   object
 3   purpose           1000 non-null   object
 4   amount            1000 non-null   int64 
 5   saving_acct       1000 non-null   object
 6   present_emp       1000 non-null   object
 7   installment_rate  1000 non-null   int64 
 8   sex               1000 non-null   object
 9   other_debtor      1000 non-null   object
 10  present_resid     1000 non-null   int64 
 11  property          1000 non-null   object
 12  age               1000 non-null   int64 
 13  other_install     1000 non-null   object
 14  housing           1000 non-null   object
 15  n_credits         1000 non-null   int64 
 16  job               1000 non-null   object
 17  n_people          1000 non-null   int64 
 18  telephone         1000 non-null   object
 19  foreign           1000 non-null   object
 20  response          1000 non-null   int64 
dtypes: int64(8), object(13)
memory usage: 164.2+ KB

Data cleaning


In [7]:
german_data = credit
german_data.head()

#Cleaning data
german_data.chk_acct.replace(['A11','A12','A13','A14'],['little', 'moderate','rich', 'no checking account'], inplace = True)
german_data.credit_his.replace(['A30','A31','A32','A33','A34'],['no credits taken',
                                                               'all credits paid duly',
                                                               'existing credits paid duly',
                                                               'delay in paying off in the past',
                                                               'other credits existing'], inplace=True)
german_data.purpose.replace(['A40','A41','A42','A43','A44','A45','A46','A47','A48','A49','A410'], 
                            ['car (new)','car (used)', 'furniture/equipment','radio/television',
                             'domestic appliances','repairs', 'education', 'vacation', 'retraining',
                             'business', 'others'], inplace=True)
german_data.saving_acct.replace(['A61','A62','A63','A64','A65'], 
                                ['little', 'moderate', 'rich','quite rich','no savings account'], inplace=True)
german_data.present_emp.replace(['A71','A72','A73','A74','A75'], 
                                ['unemployed', '< 1yr', '>=1yr & <4yr', '>=4yr & <7yr','>=7yr'], inplace=True)
german_data.sex.replace(['A91','A92','A93','A94','A95'], 
                        ['male : divorced', 'female : divorced/married', 
                         'male : single', 'male : married/widowed', 'female : single'], inplace=True)
german_data.other_debtor.replace(['A101','A102','A103'], 
                                 ['none', 'co-applicant', 'guarantor'], inplace=True)
german_data.property.replace(['A121','A122','A123','A124'],
                             ['real estate', 'building/life insurance', 'car/others',
                             'unknown/no property'], inplace=True)
german_data.other_install.replace(['A141','A142','A143'],['bank', 'stores', 'none'], inplace=True)
german_data.housing.replace(['A151','A152','A153'],['rent','own','free'], inplace=True)
german_data.job.replace(['A171','A172','A173','A174'],
                        ['unskilled - non-resident', 'unskilled - resident', 
                         'skilled employee', 'self-employed'],inplace=True)
german_data.telephone.replace(['A191','A192'],['none', 'yes'], inplace=True)
german_data.foreign.replace(['A201','A202'],['yes','no'], inplace=True)
german_data.response.replace([1,2], [1,0], inplace=True)

german_data.head()
Out[7]:
chk_acct duration credit_his purpose amount saving_acct present_emp installment_rate sex other_debtor ... property age other_install housing n_credits job n_people telephone foreign response
0 little 6 other credits existing radio/television 1169 no savings account >=7yr 4 male : single none ... real estate 67 none own 2 skilled employee 1 yes yes 1
1 moderate 48 existing credits paid duly radio/television 5951 little >=1yr & <4yr 2 female : divorced/married none ... real estate 22 none own 1 skilled employee 1 none yes 0
2 no checking account 12 other credits existing education 2096 little >=4yr & <7yr 2 male : single none ... real estate 49 none own 1 unskilled - resident 2 none yes 1
3 little 42 existing credits paid duly furniture/equipment 7882 little >=4yr & <7yr 2 male : single guarantor ... building/life insurance 45 none free 1 skilled employee 2 none yes 1
4 little 24 delay in paying off in the past car (new) 4870 little >=1yr & <4yr 3 male : single none ... unknown/no property 53 none free 2 skilled employee 2 none yes 0

5 rows × 21 columns


Checking for missing values


In [8]:
#Checking missing values in the data 
print("Missing values in each column:\n{}".format(german_data.isnull().sum()))
Missing values in each column:
chk_acct            0
duration            0
credit_his          0
purpose             0
amount              0
saving_acct         0
present_emp         0
installment_rate    0
sex                 0
other_debtor        0
present_resid       0
property            0
age                 0
other_install       0
housing             0
n_credits           0
job                 0
n_people            0
telephone           0
foreign             0
response            0
dtype: int64

Comments: No missing values in the data

Exploratory analysis

In [9]:
german_data[["amount", "duration", "age"]].describe().round(2)
Out[9]:
amount duration age
count 1000.00 1000.00 1000.00
mean 3271.26 20.90 35.55
std 2822.74 12.06 11.38
min 250.00 4.00 19.00
25% 1365.50 12.00 27.00
50% 2319.50 18.00 33.00
75% 3972.25 24.00 42.00
max 18424.00 72.00 75.00

Comments: The average credit amount taken by users is ~3.3K with an average duration of 20 months and 36 yrs of age
There is a very high variation in the credit amount
The high difference in the median and mean of the credit amount indicates, amount is highly positively skewed whereas duration and age are slightly positively skewed*

Distribution plots

In [10]:
#Distribution Plots

fig = make_subplots(rows=1, cols=3)

trace0 = go.Histogram(x=german_data["amount"], name='Credit Amount Distribution')
trace1 = go.Histogram(x=german_data["duration"], name="Duration Distribution")
trace2 = go.Histogram(x=german_data["age"], name="Age Distribution")

fig.append_trace(trace0,1,1)
fig.append_trace(trace1,1,2)
fig.append_trace(trace2,1,3)

#Updating xaxes and yaxes

fig.update_xaxes(title_text="Credit Amount", row=1, col=1)
fig.update_xaxes(title_text="Duration", row=1, col=2)
fig.update_xaxes(title_text="Age", row=1, col=3)

fig.update_yaxes(title_text="Count", row=1, col=1)

fig.update_layout(height=500, width=1000, title_text="Distribution Plots")
fig.show()

Comments:
- Data shows that most of the credit amount is between 1500 to 4000
- As expected, credit amount is positively skewed which indicates that more people take smaller amounts of loan
- The duration and age distribution are also slightly positively skewed

Boxplots

In [11]:
#Distribution of credit amount across various reasons 

fig2 = px.box(german_data, x="purpose", y="amount", color="response", title="Distribution of credit amount across various reasons")

fig2.update_layout(height=500, width=1000)
fig2.show()

Please Note:
Response 1 = Good Debt (in blue)
Response 0 = Bad Debt (in red)


Comments:
- To define maximum loan amount which can prevent bad debt, the maximum non outlier value of good debt should be used as a benchmark across each category. For Eg: For Education, the max value of loan to be given can be around 8K after which the probability of it becoming a bad debt increases
- Users have the highest median amount for used cars. The spread in the distribution of the used cars is also higher than all the other reasons.
- For automobile loans, Customers who are looking for used car should be targeted more as they take loans of with significantly more amount. However, there should be more due diligence as used car borrowers also default more.

Scatter matrix

In [12]:
#Scatter plots
import plotly.figure_factory as ff

fig3 = ff.create_scatterplotmatrix(german_data[['age','duration','amount']], diag='box', 
                                   #index='index',
                                   colormap='Portland',colormap_type='cat',height=700, width=700)
fig3.show()

Comments: As expected, there is a good positive correlation between the duration and credit amount, with correlation coefficient of 62%

K - Means Clustering for Segmentation

Data Preprocessing for clustering

In [13]:
# Data for K-means clustering
german_data_cluster = german_data[['age', 'amount', 'duration']]
print("Original variables:\n{}" .format(german_data_cluster.head()))
      
german_data_cluster_tr = np.log(german_data_cluster);
print("Log transformed variables:\n{}" .format(german_data_cluster_tr.head()))
Original variables:
   age  amount  duration
0   67    1169         6
1   22    5951        48
2   49    2096        12
3   45    7882        42
4   53    4870        24
Log transformed variables:
        age    amount  duration
0  4.204693  7.063904  1.791759
1  3.091042  8.691315  3.871201
2  3.891820  7.647786  2.484907
3  3.806662  8.972337  3.737670
4  3.970292  8.490849  3.178054

Comments: As seen in the distribution plots, It seems like our variables are highly skewed. Hence, we will perform logarithm transformation to our variables to eliminate the skewness.

In [14]:
#Distribution of transformed variable
fig, ax = plt.subplots(1,3,figsize=(20,5))
plt.suptitle('DISTRIBUTION PLOTS OF TRANSFORMED VARIABLES')
sns.distplot(german_data_cluster_tr['amount'], bins=40, ax=ax[0], axlabel="Credit Amount");
sns.distplot(german_data_cluster_tr['duration'], bins=40, ax=ax[1], color='salmon', axlabel="Duration");
sns.distplot(german_data_cluster_tr['age'], bins=40, ax=ax[2], color='darkviolet', axlabel="Age");

Comments: Now the variables seem normally distributed

Finding the optimal number of clusters using the elbow method

In [15]:
scaler = StandardScaler()
german_data_cluster_scaled = scaler.fit_transform(german_data_cluster_tr)
In [16]:
#Using the elbow method to define the optimal number of clusters
distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(german_data_cluster_scaled)
    distortions.append(kmeanModel.inertia_)
    
#Plotting the elbow method 

plt.figure(figsize=(10,5))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
Out[16]:
Text(0.5, 1.0, 'The Elbow Method showing the optimal k')

Comments: There is hardly any change in distortion after k=4, hence we will consider the optimal number of clusters to be 4

In [17]:
# k-means algorithm
k = 4
kmeans = KMeans(n_clusters=k, random_state=0).fit(german_data_cluster_scaled)
german_data['Cluster'] = kmeans.labels_

#Adding the clusters back to the data
german_data['Cluster'] = german_data['Cluster'].astype('category')
In [18]:
german_data.head()
Out[18]:
chk_acct duration credit_his purpose amount saving_acct present_emp installment_rate sex other_debtor ... age other_install housing n_credits job n_people telephone foreign response Cluster
0 little 6 other credits existing radio/television 1169 no savings account >=7yr 4 male : single none ... 67 none own 2 skilled employee 1 yes yes 1 0
1 moderate 48 existing credits paid duly radio/television 5951 little >=1yr & <4yr 2 female : divorced/married none ... 22 none own 1 skilled employee 1 none yes 0 3
2 no checking account 12 other credits existing education 2096 little >=4yr & <7yr 2 male : single none ... 49 none own 1 unskilled - resident 2 none yes 1 0
3 little 42 existing credits paid duly furniture/equipment 7882 little >=4yr & <7yr 2 male : single guarantor ... 45 none free 1 skilled employee 2 none yes 1 3
4 little 24 delay in paying off in the past car (new) 4870 little >=1yr & <4yr 3 male : single none ... 53 none free 2 skilled employee 2 none yes 0 3

5 rows × 22 columns

Cluster visualization

In [19]:
#3D scatter plot of the cluster

fig4 = px.scatter_3d(german_data, x='age', y='duration', z='amount',color='Cluster')
fig4.update_layout(height=500, width=700, title_text="3D Map of Cluster")
fig4.show()
In [20]:
#Cluster sizes

cluster_size = german_data.groupby('Cluster', as_index=True).size()
cluster_size
Out[20]:
Cluster
0    231
1    303
2    248
3    218
dtype: int64

Comments: The distribution of the data is evenly divided into each of the clusters

Other Insights on Cluster Behavior

In [21]:
#Countplots to identify segments of customers

sns.set_style('white') 
fig, ax = plt.subplots(3,2,figsize=(21,20))
plt.suptitle('COUNT PLOTS',fontsize=15)
sns.countplot(german_data['job'], ax=ax[0][0], palette=sns.color_palette('RdBu'))
sns.countplot(german_data['housing'], ax=ax[0][1], palette=sns.color_palette('RdBu'))
sns.countplot(german_data['saving_acct'], ax=ax[1][0], palette=sns.color_palette('BuGn_r'))
sns.countplot(german_data['chk_acct'], ax=ax[1][1],palette=sns.color_palette('BuGn_r')[4:])
sns.countplot(german_data['purpose'], ax=ax[2][0], palette=sns.color_palette('RdBu_r'))
sns.countplot(german_data['sex'], ax=ax[2][1],palette=sns.color_palette('RdBu_r'))

ax[2][0].tick_params(labelrotation=45)
ax[0][0].set(xlabel="Job", ylabel="Count")
ax[0][1].set(xlabel="Housing", ylabel="Count")
ax[1][0].set(xlabel="Saving Accounts", ylabel="Count")
ax[1][1].set(xlabel="Checking Accounts", ylabel="Count")
ax[2][0].set(xlabel="Purpose", ylabel="Count")
ax[2][1].set(xlabel="Sex and Status", ylabel="Count")
Out[21]:
[Text(0, 0.5, 'Count'), Text(0.5, 0, 'Sex and Status')]
In [22]:
#Distribution of age in each cluster

cluster0 = german_data[german_data['Cluster']==0]
cluster1 = german_data[german_data['Cluster']==1]
cluster2 = german_data[german_data['Cluster']==2]
cluster3 = german_data[german_data['Cluster']==3]

fig, ax = plt.subplots(4,1,figsize=(10,6), constrained_layout=True, sharex=True)
ax[0].title.set_text('Cluster 0')
ax[1].title.set_text('Cluster 1')
ax[2].title.set_text('Cluster 2')
ax[3].title.set_text('Cluster 3')
ax[0].axes.xaxis.set_visible(False)
ax[1].axes.xaxis.set_visible(False)
ax[2].axes.xaxis.set_visible(False)
sns.distplot(cluster0['age'], color='darkcyan', bins=10, ax=ax[0])
sns.distplot(cluster1['age'], color='steelblue', bins=10, ax=ax[1])
sns.distplot(cluster2['age'], color='sandybrown', bins=10, ax=ax[2])
sns.distplot(cluster3['age'], color='indianred', bins=10, ax=ax[3])
plt.xlabel('Age', fontsize=20)
Out[22]:
Text(0.5, 0, 'Age')
In [23]:
#Distribution of credit amounts in each cluster 

fig, ax = plt.subplots(4,1,figsize=(10,6), constrained_layout=True, sharex=True)
ax[0].title.set_text('Cluster 0')
ax[1].title.set_text('Cluster 1')
ax[2].title.set_text('Cluster 2')
ax[3].title.set_text('Cluster 3')
ax[0].axes.xaxis.set_visible(False)
ax[1].axes.xaxis.set_visible(False)
ax[2].axes.xaxis.set_visible(False)

sns.distplot(cluster0['amount'], color='darkcyan', bins=10, ax=ax[0])
sns.distplot(cluster1['amount'], color='steelblue', bins=10, ax=ax[1])
sns.distplot(cluster2['amount'], color='sandybrown', bins=10, ax=ax[2])
sns.distplot(cluster3['amount'], color='indianred', bins=10, ax=ax[3])
plt.xlabel('Credit Amount', fontsize=20)
Out[23]:
Text(0.5, 0, 'Credit Amount')
In [24]:
#Distribution of duration in each cluster

fig, ax = plt.subplots(4,1,figsize=(10,6), constrained_layout=True, sharex=True)
ax[0].title.set_text('Cluster 0')
ax[1].title.set_text('Cluster 1')
ax[2].title.set_text('Cluster 2')
ax[3].title.set_text('Cluster 3')
ax[0].axes.xaxis.set_visible(False)
ax[1].axes.xaxis.set_visible(False)
ax[2].axes.xaxis.set_visible(False)

sns.distplot(cluster0['duration'], color='darkcyan', bins=10, ax=ax[0])
sns.distplot(cluster1['duration'], color='steelblue', bins=10, ax=ax[1])
sns.distplot(cluster2['duration'], color='sandybrown', bins=10, ax=ax[2])
sns.distplot(cluster3['duration'], color='indianred', bins=10, ax=ax[3])
plt.xlabel('Duration', fontsize=20)
Out[24]:
Text(0.5, 0, 'Duration')
In [25]:
#Defining a function to create plot

def get_df(data):
    out = data.value_counts(normalize=True).reset_index()
    return(out)

def plot(x):
    fig = go.Figure()
    fig.add_trace(go.Bar(
        x=get_df(cluster0[x])['index'],
        y=get_df(cluster0[x])[x],
        name='Cluster 0',
        marker_color='mediumaquamarine'
    ))
    fig.add_trace(go.Bar(
        x=get_df(cluster1[x])['index'],
        y=get_df(cluster1[x])[x],
        name='Cluster 1',
        marker_color='steelblue'
    ))
    fig.add_trace(go.Bar(
        x=get_df(cluster2[x])['index'],
        y=get_df(cluster2[x])[x],
        name='Cluster 2',
        marker_color='sandybrown'
    ))
    fig.add_trace(go.Bar(
        x=get_df(cluster3[x])['index'],
        y=get_df(cluster3[x])[x],
        name='Cluster 3',
        marker_color='indianred'
    ))

    fig.update_layout(barmode='group', xaxis_tickangle=45, title=x)
    fig.show()
In [26]:
plot('saving_acct')
In [27]:
plot('installment_rate')
In [28]:
plot('housing')
In [29]:
plot('n_credits')